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Abstract 

Motivation: Function of proteins or a network of interacting proteins often involves com- 
munication between residues that are well separated in sequence. The classic example is the 
participation of distant residues in allosteric regulation. Bioinformatic and structural analysis 
methods have been introduced to infer residues that are correlated. Recently, increasing atten- 
tion has been paid to obtain the sequence properties that determine the tendency of disease 
related proteins (A/5 peptides, prion proteins, transthyretin etc.) to aggregate and form fibrils. 
Motivated in part by the need to identify sequence characteristics that indicate a tendency to 
aggregate, we introduce a general method that probes covariations in charged residues along 
the sequence in a given protein family. The method, which involves computing the Sequence 
Correlation Entropy (SCE) using the quenched probability P Sk (i,j) of finding a residue pair 
at a given sequence separation s^, allows us to classify protein families in terms of their SCE. 
Our general approach may be a useful way in obtaining evolutionary covariations of amino acid 
residues on a genome wide level. 

Results: We use a combination of SCE and clustering based on the principle component analysis 
to classify the protein families. From an analysis of 839 families, covering approximately 500,000 
sequences, we find that proteins with relatively low values of SCE are predominantly associated 
with various diseases. In several families, residues that give rise to peaks in P Sk (i, j) are clustered 
in the three dimensional structure. For the class of proteins with low SCE values there are 
significant numbers of mixed charged- hydrophobic (CH) and charged-polar (CP) runs. Our 
findings suggest that low values of SCE and the presence of (CH) and/or (CP) may be indicative 
of disease association or tendency to aggregate. Our results lead to the hypothesis that functions 
of proteins with similar SCE values may be linked. The hypothesis is validated with a few 
anecdotal examples. The present results also lead to the prediction that the overall charge 
correlations in proteins affect the kinetics of amyloid formation - a feature that is common to 
all proteins implicated in neurodegenerative diseases. 

Introduction 

The classification of proteins based on their structures into families is useful not only in 
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assigning distinct functions but also for examining the evolution of sequences with related func- 
tions. Because proteins in a family are descendants of the same ancestral protein, it is logical to 
postulate that the observed sequence differences are the result of evolutionary pressure which 
vary greatly across distinct organisms. Sequence variations are tempered by the requirements of 
native state stability and function. Destabilization of the folded state by mutations at one site 
can be compensated by mutations at distant or nearby sites (Lesk and Chothia, 1980; Neher, 
1994). Thus, it is important to study covariations of amino acids at distinct sites to decipher 
if there is communication between the two, especially as it pertains to function. Long-range 
communications between several residues (both along the sequence and across domains or inter- 
faces in protein complexes) are crucial for biological function. Thus, it might also be necessary 
to introduce methods to infer multi-site variations across sequences in order to understand a 
number of issues in proteomics. 

Correlations between amino acids in protein families have been probed using computational 
methods beginning with the classic works of Lesk and Chothia and Altschuh et al. (Lesk 
and Chothia, 1980; Altschuh et al. 1987). Several studies (Neher, 1994; Taylor and Hatrick, 
1994; Pollock and Taylor, 1997) have discovered relationships between coordinated amino acid 
changes that occur during evolution and the corresponding structural alterations. The working 
hypothesis in these studies is that a mutation at a site that compromises the function is often 
compensated by a mutation at another site that is in proximity in the three dimensional struc- 
ture. The difficulty in validating the working hypothesis arises largely because multi-correlation 
effects, which are difficult to capture (Pollock and Taylor, 1997), can be important in com- 
pensating a given mutation. Nevertheless, the computational methods that capture sequence 
covariations have provided insights in a number of areas of protein science (Landgraf et al., 2001; 
Pazos et al., 1997; Olmea et al., 1999; Fariselli and Casadio, 1999; Lockless and Ranganathan, 
1999; Lichtarge and Sowa, 2002). 

To infer the functional importance of correlated mutations, it is crucial to include physico- 
chemical characteristics of amino acids (charge, volume of side-chains, hydrophobicity, etc.) 
to describe the positions in a multiple sequence alignment (MSA) (Lesk and Chothia, 1980; 
Neher, 1994). Based on a study of divergent evolution in a set of protein families with 
known folds (Chelvanayagam et al., 1997) it has been argued that only charged residues show 
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discernible covariation at all evolutionary distances. With these observations in mind, we have 
investigated, using a new method, covariations among charged residues in 839 families. To 
obtain such correlations we introduce a function, the Sequence Correlation Entropy (SCE), 
that measures the extent to which two sites along a given sequence are coupled. The values of 
SCE for protein families show that families/functions are associated with specific patterns of 
charges. There is a relationship between the degree of correlation of charged amino acids and 
the disease associations of a family. Families with high degree of correlation between charged 
residues also have many significant mixed charged- hydrophobic/polar runs in the sequences. 
These significant findings suggest that charges occur in well defined patterns. Furthermore, 
variations in charges along sequences occur often in a correlated fashion in the evolutionary 
process. 

System and Methods 

Sequence correlation function and the associated "entropy": We introduce a general 
measure that probes correlation between specific residues that are separated by a given length 
for a database of sequences. Here we focus on charged residues (D and E are negatively charged, 
and K and R are positively charged). To measure the correlation along the sequence between 
two charged residues, i and j e {+,-}), we introduce the Sequence Correlation Entropy 



where Sk is the sequence separation between the residues, k labels the pairs and I ma x(i,j) 

is the total number of sequence separations between residues i and j along the sequences of the 
family. We choose those pairs for which the locations of i and j are consecutive, or only those 
pairs for which there is no identical pair located between them. The probability of finding 
residues i and j at s k in the MSA is 



(SCE) 
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where n seq (i,j) is the number of sequences in the MSA, n®(i,j) is the number of pairs of the 
type in sequence / and n^ l \i, j)[sk\ is the number of pairs of type from sequence I at 
separation Sk- This equation is meaningful only if the statistical ensemble contains at least one 
pair of type (i, j) . Note that P Sk (i, j) satisfies the normalization condition Ylk=i Ps k (i, j) = 1- 
Because the SCE uses a "quenched" sum (no preaveraging over all the sequences in the MSA) 
over the sequences of a given family, significant correlations, if present, can be gleaned. In 
contrast, in the mutual information function the equivalent of Eqn.Q would be 

1 = 1 1 

which involves implicit preaveraging over all the sequences in the MSA and might therefore 
mask correlations between residues. The SCE depends only on Sk and P Sk (i,j) regardless of 
where % and j are located. This accounts for the possibility that, to preserve the function, small 
rearrangements in the sequence could be preferred to compensatory changes at fixed positions 
along the sequence. 

To assess the significance of SCE it is necessary to calculate its expected value if the pairs 
are randomly distributed. For a protein family, with Lp^M being the sequence length in the 
MSA, let N(i,j) be the total number of pairs of type and I ma x(hj) = ^fam — 1 be the 
maximum sequence separation between pairs of residues. If the pair occurs randomly, the 
expected value of Pi r k and) \i , j) is Ps k and \i,j) = Ima l {id) provided N(i,j) > I max (i,j). Otherwise, 
Psl and \i, j) = N }- -v To compare results for SCE from all families on equal footing, we use 

S(i,j) = - x 100 (4) 

where S^ rand \i,j) = —ln(Ps k and \i, j)). If S(i,j) = it implies that amino acids i and j always 
occur at separation Sk in all members of the family. A relatively small value of S(i,j) means 
that there is a preferred sequence separation s fc for the pair. 

Significant mixed runs of charged and hydrophobic/polar residues: Karlin et 
al. (Karlin, 1995; Karlin et al., 2002) found that, in five eukaryotic genomes, multiple long 
runs of given types of amino acids occur in proteins associated with diseases. For example, 
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multiple long runs of glutamine, alanine, and serine dominate in Drosophila melanogaster 
whereas in human sequences a preponderance of glutamate, proline, and leucine is found. 
Guided by these findings, we searched for significant mixed charged-hydrophobic (CH) or 
charged-polar (CP) runs. A mixed CH (CP) run is the longest possible segment of consecutive 
amino acids along the sequence such that the first and the last positions are occupied by 
charged amino acids while residues in between are either charged or hydrophobic (polar). If 
Prandom = (P+) n+ (P-) n ~ {P H ) nB L seq < lO" 3 where P+, P_, P H are the percentage of +, - 
charged, hydrophobic (H) residues in the whole sequence, n + , n_ and n# are the numbers of 
each such type of residue in the given run and L seq is the length of the sequence, then a CH run 
is significant. Significance for (CP) runs is similarly defined. Using the number of significant 
mixed CH runs (N run (CH)) and CP runs (N run (CP)) in each sequence in the MSA whose 
real length is at least half of the length of the alignment, we calculated the average number of 
significant mixed runs per sequence R run = tP^; N run is either N run (CH) or N run (CP), and 
N seq is the number of sequences in the family. 



Database of aligned sequences: We used two types of sequence align- 
ments: (i) One obtained from the Pfam database (Bateman et al, 2002) 
(http://www.sanger.ac.uk/Software/Pfam/index.shtml) which is composed of families of 



protein domains; (ii) The other is produced by aligning a query sequence with similar sequences 
using the Psi-Blast software (Altschul et al, 1990) (http://www.ncbi.nlm.nih.gov/BLAST) with 
the default settings. In the February 2002 release Pfam had 3360 protein families that covered 
68% of protein sequences. For some families (prions and globins) we used both the Pfam 
database and the Blast alignment to check whether the two approaches lead to comparable 
results. For other sequences (r protein or Sup35), for which there is no known 3D structure, we 
used only the Blast alignment. 

To compute SCE, we considered a dataset of 839 families from Pfam (the list is available 
upon request). The criteria for choosing the families were: i) the average length of sequences 
in the family be at least 40 residues; ii) the number of members in the family be large so 
that an ensemble of pairs can be created; iii) the families give a good coverage of the 
various protein functions. The database of families ("All") consists of heat shock proteins 
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(21, "HSP"), nucleic acids (DNA or RNA) binding proteins (152, "NA"), disease-related 
proteins (prions, other amyloidogenic proteins, cancer, allergens, toxins) (40, "Disease"), viral 
proteins (209, "Viruses") including viral nucleocapsid proteins (26, "Capsid") and "normal" 
proteins (595, "Normal"). The number of families is given in parenthesis. For example, the 
families from the "Disease" class represent the subset of all the families retrieved from Pfam 
with keyword "disease" which satisfy also the above mentioned criteria for statistical signifi- 
cance. The functions of the families are diverse enough that we can draw meaningful conclusions. 



Implementation 



Disease associations based on clustering of sequence correlation entropy: The 



P[S(i,j)] distributions for the (+,+), (+,-) an d (-,-) are broad (Fig. 1(a)). Therefore any 
attempt to classify families based entirely on these distributions is bound to be arbitrary. A 
more reliable method is to use a clustering procedure to divide the families according to their 
S(i, j) values. We start by constructing a 839 x 3 matrix with the rows representing the families 
and the columns corresponding to S(i,j) values. Inspired by the Principal Component Analysis 
(PCA) clustering procedure (Jolliffe, 1986), we transformed the above matrix into the 839 x 839 
matrix of the euclidean distances between all pairs of families. An analysis of the eigenvalues of 
this matrix shows that the first 4-5 eigenvalues are much larger in magnitude than the others. 
Therefore, if there exists a tendency of the protein families to cluster then such a tendency will 
manifest itself in the behavior of the eigenvectors associated to the largest eigenvalues (because 
higher order eigenvectors are bound to remove structure from the data points). Indeed, the plot 
of the second eigenvector (EV2) versus the first eigenvector (EV1) (data not shown) reveals 
two clusters of data: one corresponding to positive values of EV2, the other corresponding 
to negative values of EV2. But the boundary between the two clusters is not well defined. 
A much better picture is provided by the plots of EV4 versus EV1 from Fig.gjaj and EV4 
versus EV2 from Fig. ( |2(bJ| ). Both graphs present three regions which we represent by filled 
triangles, filled circles and stars. A mapping of the points from one graph to the other shows 
that the corresponding regions are populated by almost the same set of points. But because the 
number of points in the corresponding regions from the two graphs varies somewhat, we choose 
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to define the clusters based on Fig. ( |2(b)| ). This choice leads to a more balanced distribution 
of points in each cluster: the HC cluster contains 210 points, the MC cluster has 361 points, 
and the remaining 268 points are in the LC cluster. By mapping the points from the three 
clusters to families and their S(i,j) values, we can therefore classify the protein families in 3 
classes: (1) Highly Correlated (HC) families have at least two of the pairs satisfy the constraints 
S(+,+) < 52%, S{+,-) < 42%, and S(-,-) < 50%. (2) If at least two of the pairs satisfy 
S(+, +) = (53% - 63%), S{+, -) = (43% - 54%) and S(-, -) = (51% - 60%) then the family 
is considered to have moderate correlation (MC). (3) When at least two of the pairs lie outside 
the range, we assume that there is little correlation (LC). 

Many protein families known to be associated with various diseases belong to the HC class 
(Table UJ. Examples of families belonging to the HC class are prions, Aft peptide, the r 
protein and Sup35 (one of the Yeast prions) which are all known to aggregate and form fibrils. 
This result correlates well with the findings of various studies of protein aggregation that for 
prions charged residues play a key role (Billeter et al., 1995). Our prediction supports the 
observation that mutations of charged residues drastically affect the fibrillization kinetics in a 
variety of proteins in which aggregation occurs from the unfolded state (Massi et al., 2002; Chiti 
et al., 2003). Other families with high degree of sequence correlation between charged amino 
acids represent proteins that cause diseases like viruses (HCV, Adeno-hexon, Vpu, Gag-pl7), 
Androgen receptor (Kennedy disease), the lyme disease protein and P53 (whose malfunctioning 
is linked to cancer). Proteins which bind nucleic acids (DNA and RNA which are highly charged) 
like DNA-polB, recA and IF3, together with proteins that are implicated in the response of the 
organism to environmental stress (PAL). The largest category of proteins represented in the HC 
class are those associated with various diseases (Table ITT]) . Families that belong to the MC class 
represent a mixture of structural proteins, enzymes, transport proteins and some disease-related 
proteins. Some examples are: Aerolysin (related to deep wound infections), aldedh (allergens), 
Alpha_TIF (viral protein), Clathrin (structural protein in vesicles), FMN_dh (electron transfer), 
and GCV_H (glycine cleavage H-protein). 

Three control calculations are needed to ascertain if the classification of protein families based 
on the clustering of SCE is reliable. 

(a) It is likely that the high sequence correlation among charged residues can arise because 
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of unusual sequence composition in disease related proteins. If the fraction of total number of 
charged residues in a sequence exceeds the typical 23% observed in protein structures (Creighton, 
1993), then one might expect it to belong to the HC class. Computation of the sequence 
composition of charged residues in all 839 families shows no correlation between the observed 
fraction of charged residues and the associated value of S(i, j) (RID, unpublished). 

(b) To ascertain if our findings are a result of high sequence identity, we explored the 
relationship between the average sequence identity in a family (as presented in the Pfam 
entry) and its class based on S(i,j) values. The distributions of sequence identities for 
families belonging to the three classes show that families with similarities above 90% belong 
to the HC class, while families with similarities below 15% are most likely to belong to 
the LC class. This is exactly what one expects based on the Eqn.Q and Eqn.Q. But, 
in general, there is no good correlation between the sequence similarity in a family and its 
class (Fig jl(b)"| . There is considerable variation in the sequence identities between families 
in both the HC and MC class (Fig |l(b)"| ). Both these control calculations show that the 
values of SCE among charged residues may indeed be associated with the function of the protein. 

(c) To determine the significance of the S(i,j) values in the various families, a comparative 
analysis with a random dataset is required. For this, we built 100,000 sets each containing 
1000 sequences of length 100. Each position in a sequence was assigned one of the 20 types of 
amino acids with equal probability. The corresponding distributions of S(i,j) values (data not 
shown) are narrow with averages corresponding to 69%, 57% and 76% for S(+,+), £(+,—), 
and §(—,—) respectively. Using these data sets, the Pvalues for S(i,j) in the HC class are 
< 10~ 5 , which shows that the calculated S(i,j) values are very significant. More importantly, 
the link between S(i,j) and the tendency to aggregate is indeed meaningful. 

Specific sequence separations in charged residues may be preferred in proteins 
belonging to the HC class: Plots of P Sk (i,j) as a function of of Sfc for three of the families in 
the HC class show (FigsJHJ) that there is considerable structure in P Sk (i,j) which implies that 
in these families there is distinct correlation among charged residues at preferred values of Sk- 
Surprisingly, despite the similarity in the overall degree of correlation in prions and DNA-polB 
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and HCV capsid proteins, the behavior of P Sk (i,j) in prions resembles more closely that of the 
HCV capsid: in both cases there are a few peaks separated by deep valleys. On the other hand, 
in DNA-polB P Sk (i,j) decays smoothly with the increase in s^. 

The availability of a representative structure in these families allows us to map these high 
probability sequence correlations (Figs|3J) an d their occurrence in the structure. Mapping onto 
the NMR structure of the human prion protein (1QLX) of the positions that are involved in the 
pairs that correspond to the largest P Sk (i,j) (Fig. |4(a)| ) shows that these 30 positions (which are 
mostly found in the 3 helices) are clustered almost entirely on one face of the three dimensional 
prion structure. In the prion family the localization of charged amino acids in the 3D structure 
is reflected in the specific peaks in P Sk (i,j). A similar mapping of positions (selected on the 
same basis as in prions) in DNA-polB on a structure from Thermococcous Gorgonarius (1TGO) 
shows that these positions are uniformly distributed on all faces of the structure. If the linear 
density of charges (number of charged residues divided by sequence length) is roughly uniform, 
we expect P Sk (i,j) to decay smoothly without any structure. This is the case in DNA-polB 
family (FigsOJ). As a result we find that, at the tertiary structure level, the charges are roughly 
uniformly distributed throughout the surface (Fig. |4(b)| ). 

Could the observation of preferred sequence separation be anticipated from sequence entropy 
calculation alone? To answer this, we calculated S(i) = — Y2a=i s " Pa{i)lnp a {i) using four classes 
{Ndass — 4) of amino acids (positively and negatively charged, polar, hydrophobic) and where 
p a (i) is the probability of observing amino acid a at site i in a MSA (p a (i) = \ from random 
considerations). We only discuss the prion family because in this family many positions are 
strongly conserved. There are a total of 35 charged residues in hPrP (23-253) out of which 
14 are perfectly conserved {S(i) = 0). In the structured region of hPrP only 6 out of the 14 
appear in pairs with high degree of correlation. Thus, there is no obvious relationship between 
conservation of individual charged residues and their covariation as measured by SCE. 

Link between number of significant mixed runs and SCE: The distribution of 
Rrun{CH) for the protein families in the three classes (HC, MC, LC) shows clear differences 
between them (data not shown). In proteins belonging to the HC class (104 members) there is 
a long tail in the distribution of R run (CH) which implies that a large number of mixed (CH) 
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runs are found. In the HC class we find that a significant number of proteins have R run {CH) > 
3 whereas the maximum value is R run (CH) < 3 for protein families belonging to the MC class 
(397 members). In the LC class (338 members) the number of significant mixed (CH) runs 
hardly exceeds 2. Overall, about 62% of families in HC class have R run (CH) > 1. On the other 
hand only 43% of the families in the MC class have R run (CH) > 1 whereas only about 10% of 
families in the LC class have R run (CH) > 1. 

The results for the distribution of R run (CP) show even more dramatic differences between 
the three classes. About 21% of families in the HC class have R run (CP) > 1, whereas only 
about 2% of the MC class families have multiple (> 1) significant mixed (CP) runs. Among the 
proteins in the LC class we do not find any protein family with R run (CP) > 1. The percentage 
of families with either R run (CH) > 1 or R run (CP) > 1 is 69%, 44%, and 10% in the HC, MC, 
and LC class respectively. These results show that there is a significant correlation between the 
number of mixed charged runs and the SCE. 

Examples of families with R run (CH) > 1 are aldedh (Aldehyde dehydrogenase, aller- 
gens), Basic (myogenic Basic domain, DNA binding with bHLH motif), Bet_v_I (pathogenesis- 
related protein Bet v I family, allergens), endotoxin (bacteria toxins), bZIP_MAF (bZIP 
MAF transcription factor, cancer-related), Granin (cancer-related), Myc_N_term (cancer- 
related), prions, Arena_nucleocapsid, Bunya_RdRp (Bunyavirus RNA-dependent RNA poly- 
merase), Corona_nucleocapsid, DNA_polB, Flu_PBl (Influenza virus RNA-dependent RNA 
polymerase subunit PB1), Hanta_nucleocapsid, Paramyxo_ncap (Paramyxovirus nucleocap- 
sid protein), RHD (cancer-related), Tropomyosin (allergens), TTL (breast cancer related), 
GroEL, HSP90, P53 and actin. R run (CP) > 1 is found in Androgen_recep (Kennedy dis- 
ease), HSP90, HCV_capsid (Hepatits C virus nucleocapsid protein), Granin, Myc_N_term, P53, 
Corona_nucleocapsid, Astro_capsid (Astro virus nucleocapsid precursor), Arte_nucleocapsid (Ar- 
teri virus nucleocapsid protein), and Paramyxo_ncapsid. 

We note that, in general, R run (CP) > 1 occurs only in families of proteins associated with 
diseases, while R run (CH) > 1 is found in both families of normal proteins and of proteins 
associated with diseases. Even more interestingly, this analysis reveals differences between 
disease-related proteins which might be the reflection of the corresponding disease mechanism: 
proteins found in allergens and toxins and the majority of proteins related to cancer have large 
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Rrun(CH) values and small R run (CP) values, while the protein related to Kennedy disease and 
some of the viral nucleocapsid proteins have large R run (CP) values and small R run (CH) values. 

A summary of the major findings is given in Table |Hj There are a number of lessons that can 
be gleaned from our findings: (1) Only about 7% of all normal proteins have high correlation 
among charged residues compared to 25% of all proteins. However, among the 41 "normal" 
protein families (that are in the HC class) 28 (68%) have at least one significant (CH) or (CP) 
run. (2) The largest percentage of proteins in the HC class is from viral nucleocapsid protein 
families. These proteins, which are involved in transcription, also have a number of significant 
(CH) or (CP) runs. (3) The families of proteins that bind to nucleic acids in the HC class have 
the highest percentage of combined (CH) and (CP) runs. Taken together these results show 
that, for all families, relatively low values of SCE are linked to the number of significant (CH) 
and/or (CP) runs. 

Discussion 

Comparison with other methods for extraction of sequence correlations: In the 
Methods section we noted that a quenched average over the sequences in the MSA can reveal 
novel correlations between residues (Eqn.(J2J)). To ascertain if similar inferences can be drawn 
using other methods we performed a Mutual Information Function (MIF) (Li, 1990) analysis of 
pairs of charged residues in the various protein families. We first determined the probability to 
find a charged residue (type CI) at a given position i in a MSA, 



where 5([i] — CI) is unity if the amino acid at position i is type CI and is zero otherwise. The 
probability of finding a pair of charged residues (CI and C2) at two positions (i and j) in the 



Pi(Cl) 



(5) 




MSA 



Pij(Cl,C2) 



(6) 




The MIF is 



MIF(C1,C2)= Yl Pij{Cl,C2)ln{ 



Pij(Cl,C2) 



(7) 



i=l j=(i+l) 
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Just as before, we need to factor out the effects of the length of each MSA, so we measured the 
quantity: 

M/F-(C1,C2) = M ^ C1 - CT » (8) 
ln{L FAM ) 

The MIF*(C1,C2) values for all 839 families are much smaller than the maximum possible 
value (MIF max (Cl, C2) = 4.4, because MIF max (Cl, C2) = ln{ P{C1 ) P{C2) ) and P(C1) = 0.11). 
The lack of any discernible structure (which is indicative of correlations among the chosen 
residues) in the MIF*(C1,C2) values is likely the result of pre-averaging over the sequences 
in the MSA. Our method, which does not use preaveraging (see Eqn.(J2Jl), is therefore able to 
capture correlations that are not easily detected by the MIF approach. It is worth noting that 
MIF is suitable for many practical applications. A combination of methods might be required 
to obtain correlations between residues using sequence information alone. 



Plausible functional link among some families in the HC class: The class of proteins 
that gives the most clear and consistent signal (relatively low values of SCE and multiple signifi- 
cant (CH) and (CP) runs) is that associated with disease-related proteins like prions, viruses and 
P53. Sequence- level correlations could be the result of the constraints imposed on the evolution 
of the protein by its function in the cell. Using these observations we tentatively hypothesize that 
protein families with high degree of charge correlations may have somewhat similar functions. 
If this hypothesis is valid, we can surmise that there may be some level of similarity between 
the actions of prions and those of nucleocapsid viral proteins. Similarly, the functions of prions 
and P53 may be somewhat related. The function of prions is not known, but those of P53 and 
nucleocapsid proteins are: they both bind DNA with nucleocapsid proteins playing a vital role 
in the transcription and replication of viral DNA/RNA. Our hypothesis would suggest that the 
cellular form of prions can also bind nucleic acids. Because prions resemble the HCV nucleo- 
capsid proteins even at the level of individual P Sk (i,j) (Figs. EH) we propose that the function 
of prions could be similar to that of nucleocapsid proteins. 

There is experimental support to our inference that the functions of prions and nucle- 
ocapsid proteins may be similar. A series of studies (Sklaviadis et al., 1993; Cordeiro et 
al., 2001; Gabus et al., 2001a; Gabus et al., 2001b; Moscardini et al., 2002) have shown 
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that the prion protein has DNA strand transfer properties similar to viral nucleocapsid 
proteins. It has been postulated that in prions an unknown cofactor ("Protein X") facilitates 
the dramatic conformational transition from the predominantly a-helical structure to a state 
rich in /3-sheet. DNA strands could play the role of "protein X" in the conformational transition. 

Number of significant (CH)/(CP) runs and propensity for scrapie formation in 
prions are linked: Recent sequence and structural analysis (Kallberg et al., 2001; Dima and 
Thirumalai, 2002) has suggested that elements of secondary structure in mammalian prions are 
frustrated. By frustration we mean that the secondary structure elements in the normal cellular 
form could be better accommodated by an alternative conformation. Avian prion sequences 
are not frustrated (Dima and Thirumalai, 2002) which explains the lack of observation of the 
scrapie form in these species. These findings are further corroborated here by the variations 
in the significant (CH) and (CP) runs (Table llll|) between these species. Despite the lack of 
significant differences (see Table IIII|) in the amino acid composition (especially in charged and 
polar residues) among the various species, the chicken prion sequence as well as of the other 
avian species does not have significant mixed charged- hydrophobic/polar runs. However, (CH) 
and/or (CP) runs appear in all mammalian prions which are known to be associated with prion 
diseases. The absence of such runs might be one of the reasons for the lack of formation of the 
scrapie form of prions in avian species. 

Despite the deleterious effects of sequence correlations among charged residues in proteins 
associated with diseases, our findings suggest that charged amino acids must play an important 
role in the functions of these proteins. Viruses (like HIV and hepatitis C) have high degree 
of sequence correlations between charges. Blocking the repertoire of charges in viruses might 
impair of their capacity to induce and promote the associated disease. As a rule, protein families 
with high degree of sequence correlations also have a significant number of mixed charged- 
hydrophobic/polar runs. When charged residues appear correlated in a sequence then they are 
likely to be distributed in patches in the three dimensional structure. 

Given the potential link between high degree of charge correlation and disease it is not 
clear why these proteins have not evolved with more benign characteristics. Perhaps, there 
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are functional demands on this class of proteins that require multiple runs and significant 
charge correlations. The high degree of charge correlation and the presence of mixed (CH) 
and/or (CP) runs might be indicative of their role in protein-protein interactions or binding 
to DNA or RNA. It is also likely that the charge distributions may also be important in 
avoiding aggregation. As a corollary we find that the majority of "normal" proteins exhibit 
only moderate or weak sequence correlation between charges. The identification of correlated 
charged pairs in various families suggests that mutation of these residues can compromise their 
function. This prediction is amenable to experimental tests. 

Importance of charged residues in kinetics of fibrillization: The factors that affect 
the amyloid deposition rates have not been fully elucidated. Only recently a systematic physical 
basis that relates sequence characteristics in disease-related proteins and amyloid formation 
has been explored (Chiti et al, 2003). This study shows that the overall charge states greatly 
affect fibrillization kinetics. The deposition rate decreases as the overall charge of the disease 
related proteins increases. Similarly, we had argued (Thirumalai et al., 2003) both in prion 
proteins and A(3 peptides that the overall charge is relevant for polymerization. For example, 
the A/3i 6 _22 peptide with the sequence KLVFFAE is a significant (CH) run. This peptide has 
been shown by using solid state NMR measurements to readily aggregate into amyloid fibrils 
organized into antiparallel /3-sheets (Balbach et al, 2000). Extensive Molecular Dynamics 
simulations in explicit solvent probing the dynamics of the assembly process for three A/3 16 _ 2 2 
peptides (Klimov & Thirumalai, 2003) revealed that electrostatic and hydrophobic interactions 
play different roles in the formation of the antiparallel /3-sheet: the electrostatic interactions 
play a crucial role in the orientation of the peptides in the oligomer, while the hydrophobic 
interactions bind the peptides together. As a result, mutations at either the C-term end from 
a negatively charged residue to polar residues (E22G and E22Q) or at the middle hdrophobic 
positions (L17S/F19S/F20S) reduce considerably the stability of the oligomer. These studies 
provide additional support for our prediction that charged residues clustered into significant 
(CH) runs play an important part in the dynamics of protein aggregation. The prediction, 
based solely on bioinformatic analysis, that correlated mutations can inhibit amyloid formation 
can be experimentally tested. 
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TABLE I: Sequence correlation entropy for charged residues in protein families from the HC class 
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"The sequence identity in the family. 

b Sequence correlation entropy, expressed as a percentage (see Eq.(QJ) for (+,+) pairs. 
c Same as b, except it is for (+,-) pairs. 
d Same as b, except it is for (-,-) pairs. 



TABLE II: Link between sequence correlations and the number of mixed runs for the 839 families 



Family a 


HC 6 


MC 6 


LC 6 


(CH) C 


(CP) d 


run e 


All 


25 


43 


32 


62 


21 


69 


Normal 
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58 


39 


3 


53 
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67 



"The first column describes the functions of the families in the data set (see text for an explanation). The 450 members in the "Normal" category 

exclude the "NA" and "HSP" families. 
^Percentages of protein families in the Highly Correlated (HC), Moderately Correlated (MC), and Loosely Correlated (LC) classes. The largest 

and the smallest percentages in each column are given in bold and in italics respectively. 
c The percentage of families, among the HC class, that have at least one significant mixed charged-hydrophobic run. 
d Same as (e), except it is for mixed charged-polar runs. 

e Percentage of proteins, in the HC class, that have either one (or more) significant (CH) or one (or more) significant (CP) run(s). 



TABLE III: Amino acid composition and number of mixed runs in prion proteins 



Species 
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Human 
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33 
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36 


15 


49 
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"Percentage of hydrophobic residues. 
^Percentage of charged residues. 
c Percentage of polar residues. 
^Number of amino acids. 
e Number of significant (CH) runs. 
•^Number of significant (CP) runs. 



Figure Captions 



FigCJ (a): Histograms of S(i,j) (Eq.Q) for the 839 families. The top, middle, and bottom 
panels are for (-,-), (+,-), and (+,+) respectively, (b): The distribution of average sequence 
identity (SI) (as presented in the Pfam file of the protein family) for the families in our dataset. 
Each family was classified according to its class (HC, MC or LC) as described in the text. 

FigJ21 (a): The plot of the first and fourth eigenvectors of the matrix of euclidean distances 
between pairs of families (see text for details), (b): The plot of the second and fourth eigenvectors 
of the distance matrix. The three distinct regions in these graphs allow the classification of 
families in three classes: HC, MC, and LC (see text for details). 

FigJ3J The probability, P Sk (i,j), of finding residues i and j at separation as a function of 
ln(sk) for the prion, HCV capsid and DNA-polB families, (a) (i,j) = (+,+), (b) (i,j) = (+,-), 
(c) (i,j) = (-,-). Except for DNA-polB, the presence of peaks in P Sk (i,j) suggests enhanced 
sequence correlation at specific separations of charges. 

FigEI Mapping of highly correlated, as measured by P Sk (i,j) (see Fig.®), charged residues 
onto the three dimensional structure, (a) Space filling representation of one face of the human 
prion structure (1QLX). Residues appearing in pairs corresponding to the peaks in P Sk (i,j) 
(Fig.®) are shown in black. Highly correlated charged residues in sequence space are localized 
in the three dimensional structure, (b) Similar picture for 1TGO, which is a representative 
structure for DNA-polB. The black shades represent residues that have significant values of 
Ps k {h j) (Fig.©)- Charged residues are uniformly distributed in the three dimensional structure. 
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(c) 
FIG. 3: 
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(b) 

FIG. 4: 
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